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ABSTRACT 


The desire to predict the effort in developing or explain the 
quality of software has led to the proposal of several metrics in 
the literature. As a step toward validating these metricsy the 
Software Engineering Laboratory has analysed the Software Science 
metrics* cyclomatic complexity and various standard program meas- 
ures for their relation to 1) effort (including design through 
acceptance testing), 2) development errors (both discrete and 
weighted according to the amount of time to locate and fix) and 
3) one another. The data investigated are collected from a pro- 
duction FORTRAN environment and examined across several projects 
at once, within individual projects and by individual programmers 
across projects, with three effort reporting accuracy checks 
demonstrating the need to validate a database. When the data 
come from individual programmers or certain validated projects, 
the metrics' correlations with actual effort seem to be strong- 
est. For nodules developed entirely by individual programmers, 
the validity ratios induce a statistically significant ordering 
of several of the metrics' correlations. When comparing the 
strongest correlations, neither Software Science's E metric, 
cyclomatic complexity nor source lines of code appears to relate 
convincingly better with effort than the others. 
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I . Introduction 


Several metrics based on characteristics of the software 
product have appeared In the literature* These metrics attempt 
to predict the effort In developing or explain the quality of 
that software [11], [17], [19], [23]. Studies have applied them 
to data from various organizations to determine their validity 
and appropriateness [1], [13], [15]. However, the question of 
how well the various metrics really measure or predict effort or 
quality is still an issue in need of confirmation. Since 
development environments and types of software vary, individual 
studies within organizations are confounded by variations in the 
predictive powers of the metrics. Studies across different 
environments will be needed before this question can be answered 
with any degree of confidence. 

Among the most popular metrics have been the Software Sci- 
ence metrics of Halstead [19] and the cyclomatic complexity 
metric of McCabe [23]. The Software Science E metric attempts to 
quantify the complexity of understanding an algorithm. 
Cyclomatic complexity has been applied to establish quality 
thresholds for programs. Whether these metrics relate to the con- 
cepts of effort and quality depends on how these factors are 
defined and measured. The definition of effort employed in this 
paper is the amount of time required to produce the software pro- 
duct (the number of man-hours programmers and managers spent from 
the beginning of functional design to the end of acceptance test- 
ing). One aspect of software quality is the number of errors 
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reported during the product's development » and this Is the meas- 
ure associated with quality for this study. 

Regarding a metric evaluation^ there are several Issues that 
need to be addressed. How well do the various metrics predict or 
explain these measures of effort and quality? Does the correspon- 
dence Increase with greater accuracy of effort and error report- 
ing? How do these metrics compare In predictive power to simpler 
and more standard metrics » such as lines of source code or the 
number of executable statements? These questions deal with the 
external validation of the metrics. More fundamental questions 
exist dealing with the Internal validation or consistency of the 
metrics. How well do the estimators defined actually relate to 
the Software Science metrics? How do the Software Science 
metrics y the cyclomatlc complexity metric and the more tradi- 
tional metrics relate to one another? In this paper» both sets 
of Issues are addressed. The analysis examines whether the given 
family of metrics Is Internally consistent and attempts to deter- 
mine how well these metrics really measure the quantities that 
they theoretically describe. 

One goal of the Software Engineering Laboratory [6], C7]f 
[8]» [10], a joint venture between the University of Maryland, 
MASA/Goddard Space Flight Center and Computer Sciences Corpora- 
tion, has been to provide an experimental database for examining 
these relationships and providing Insights Into the answering of 
such questions. 
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The software comprising the database is ground support 
software for satellites. The systems analyzed consist of 51,000 
to 112,000 lines of FORTRAN source code and took between 6900 and 
22,300 man-hours to develop over a period of 9 to 21 months. 
There are from 200 to 600 modules (e.g., subroutines) in each 
system and the staff size ranges from 8 to 23 people, including 
the support personnel. While anywhere from 10 to 61 percent of 
the source code is modified from previous projects, this analysis 
focuses on Just the newly developed modules. 

The next section discusses the data collection process and 
some of the potential problems Involved. The third section 
defines the metrics and interprets the counting procedure used in 
their calculation. In the fourth section, the Software Science 
metrics are correlated with their estimators and related to more 
primitive program measures. Finally, the fifth section deter- 
mines how well this collection of volume and complexity metrics 
corresponds to actual effort and developmental errors. 

II . The Data 

The Software Engineering Laboratory collects data that deal 
with many aspects of the development process and product. Among 
these data are the effort to design, code and test the various 
modules of the systems as well as the errors committed during 
their development. The collected data are analyzed to provide 
Insights into software development and to study the effect of 
various factors on the process and product. Unlike the typical 
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controlled experlfflents where the projects tend to be smaller and 
the data collection process dominates the development process, 
the major concern here Is the software development process, and 
the data collectors must affect minimal interference to the 
developers . 

This creates potential problems with the validity of the 
data. For example, suppose we ■ are Interested in the effort 
expended on a particular module and one programmer forgets to 
turn in his weekly effort report. This can cause erroneous data 
for all modules the programmer may have worked on that week. 
Another problem is how does a programmer report time on the 
integration testing of three modules? Does he charge the time to 
the parent module of all three, even though that module may be 
Just a small driver? That is clearly easier to do than to propor- 
tion the effort between all three modules he has worked on. 
Another issue is how to count errors. An error that is limited to 
one module is easy to assign. What about an error that required 
the analysis of ten modules to determine that It affects changes 
in three modules? Does the programmer associate on.e error with 
all ten modules, an error with Just the three modules or one 
third of an error with each of the three?” The larger the system 

” Efforts [18], [21] have attempted to make this assignment 
scheme more precise by the explanation: a "fault" Is a specific 
manifestation in the source code of a programmer "error"; due to 
a misconception or document discrepancy, a programmer commits an 
"error" that oan result in several "faults" in the program. With 
this interpretation, whai are referred to as errors in this study 
should probably be called faults. In the interest of consistency 
with previous work and clarity, however, the term error will be 
used throughout the paper. 
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the more complicated the association. All this assumes that all 
the errors are reported. It is common for programmers not to 
report clerical errors because the time to fill out the error 
report form might take longer than the time to fix the error. 
These subtleties exist in most observation processes and must be 
addressed in a fashion that is consistent and appropriate for the 
environment * 


The data discussed in this paper are extracted from several 


sources . 


Effort data were obtained from a Component Status 


Deport that is filled out weekly by each programmer on the pro- 
ject. They report the time they spend on each module in the sys- 
tem partitioned into the phases of design* code and test, as well 
as any other time they spend on work related to the project, 
e.g., documentation, meetings, etc. A module is* defined as any 
named object in the system; that is, a module is either a main 
procedure, block data, subroutine or function. The Resource Sum- 
mary Form, filled out weekly by the project management, 
represents accounting data and records all time charged to the 
project for the various personnel, but does not break effort down 
on a module basis. Both of these effort reports are utilized in 
Section V of this paper to validate the effort reporting on the 
modules. The errors are collected from the Change Report Forms 
that are completed by a programmer each time a change is made to 
the system. While the collection of effort and error data is a 
subjective process and done manually , the remainder of the 
software measures are objective and their calculation is 
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automated 


A static code analyzing program called SAP [25] automati- 
cally computes several of the metrics examined in this analysis. 
On a module basis, the SAP program determines the number of 
source and executable statements, the cyelomatlc complexity, the 
primitive Software Science metrics and various other volume and 
complexity related measures. Computer Sciences Corporation 
developed SAP specifically for the Software Engineering Labora- 
tory and the program has been recently updated [1A] to incor- 
porate a more consistent and thorough counting scheme of the 
Software Science parameters. In an earlier study, Baslll and 
Phillips [3] employed the preliminary version of SAP in a related 
analysis. The next section explains the revised counting pro- 
cedure and defines the various metrics. 

III . Metric Definition 

In the application of each of the metrics, there exist vari- 
ous ways to count each of the entities. ' This section interprets 
the counting procedure used by the updated version of SAP and 
defines each of the metrics examined in the analysis. These 
definitions are given relative to the FORTRAN language, since 
that is the language used in all the projects studied here. The 
counting scheme depends on the syntactic analysis performed by 
SAP and is, therefore, not necessarily chosen to coincide exactly 
with other definitions of the various counts. 
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Primitive Software Science metrics Software Science 
defines the vocabulary metric n as the sum of the number of 
unique operators n1 and the number of unique operands n2. The 
operators fall Into three classes. 

1) Basic operators Include 


+-•/••=() 4 // .NE. .EQ. .LE. .LT. 

.GE. .GT. .AND. .OR. .XOR. .NOT. .EQV. .NEQV. 


11) Keyword operators Include 


IFO THEN /* 

IFO THEN ELSE /• 

IFO , , /* 

IFO THEN ENDIF /• 

IFO THEN ELSE ENDIF /* 

IFO THEN 

ELSEIFO THEN 
... ENDIF /® 

DO /* 

DOWHILE /* 

GOTO <target> /• 


GOTO (T1.'..Tn) <expr> /• 
GOTO <ldent>, (T1...Tn) /* 


<subr>( , ,*<target>) /* 
ENDS /* 
ERHs /• 
ASSIGNTO /• 
EOS /• 


logical If »/ 

logical lf>then-else •/ 

arithmetic If •/ 

block If •/ 

block If-then-else •/ 


case If •/ 
do loop •/ 
while loop */ 

unconditional goto: distinct 
targets Imply different operators •/ 
computed goto: different number of 
targets Imply. different operators •/ 
asslgne-d goto: distinct Identifiers 
Imply different operators •/ 
alternate return •/ 
read/write option •/ 
read/write option •/ 
target assignment */ 

Implicit statement delimiter */ 


111) Special operators consist of the names of subroutines , 
functions and entry points. 


Operands consist of the all variable names and constants. Note 


that the major differences of this counting scheme from that used 
by Baslll and Phillips [3] are in the way goto and If statements 
are counted. 


The metric n* represents the potential vocabulary, and 
Software Science defines it as the sum of the minimum number of 
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operators n1* and the minimum number of operands n2*. The poten- 
tial operator count n1* is equal to two; that is, n1* equals one 
grouping operator plus one subroutine/function designator. In 
this paper, the potential operand count n2* is equal to the sum 
of the number of variables referenced from common blocks, the 
number of formal parameters in the subroutine and the number of 
additional arguments in entry points. 

Source lines This is the total number of source lines that 
appear in the module, including comments and any data statements 
while excluding blank lines. 

Source lines - comments This is the difference between the 
number of source lines and the number of comment lines. 

Executable statements This is the number of FORTRAN exe- 
cutable statements that appear in the program. 

Cyclomatic complexity Cyclomatic complexity is defined as 
being the number of partitions of the space in a module's 
control-flow graph. For programs with unique entry and exit 
nodes, this metric is equivalent to one plus the number of deci- 
sions and in this work, is equal to the one plus sum of the fol- 
lowing constructs; logical If's, if-then-else 's , block-if's, 
block if-then-else 's , do loops, while loops, AND's, OR's, XOR's, 
EQV's, HEQV's, twice the number of arithmetic if's, n - 1 deci- 
sion counts for a computed goto with n statement labels and n 
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decision counts for a case If with n predicates. 

A variation on this definition excludes the counts of AND's, 
OR's, XOR's, EQV's and NEQV's (later referred to as 

Cyclo_cmplx_2) . 

Calls This Is the number of subroutine and function Invo- 
cations In the module. 


Calls and jumps This Is the total number of calls and 
decisions as they are defined above. 


Revisions This Is the number of versions of the module 
that are generated In the program library. 


Changes This Is the total number of changes to the system 
that affected this module. Changes are classified Into the fol- 
lowing types (a single change can be of more than one type): 

a. error correction 

b. planned enhancement 

c. Implement requirements change 

d. Improve clarity 

e. Improve user service 

f. debug statement Insertlon/deletlon 

g. optimization 

h. adapt to environment change 
1. other 


Weighted changes This is a measure of the total amount of 
effort spent making changes to the module. A programmer reports 
the amount of effort to actually Implement a given change by 
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indicating either 

a. lesa than one hour, 

b. one hour to a day, 

o. one day to three days or 

d. over three days. 

The respective means of these durations, 0.5, ^.5, 16 and 32 
hours, are divided equally among all modules affected by the 
change. The sum of these effort portions over all changes 
Involving a given module defines the weighted changes for the 
module. 

Errors This is the total number of errors reported by pro- 
grammers; l.e., the number of system changes that listed this 
module as involved in an error correction. (See the footnote at 
the bottom of page 4 regarding the usage of the term "error".} 

Weighted errors This is a measure of the total amount of 
effort spent isolating and fixing errors in a module. For error 
corrections, a programmer also reports the amount of effort spent 
Isolating the error by indicating either 

a. less than one hour, 

b. one hour to one day, 

c. more than one day or 

d. never found. 

The representative amounts of time for these durations, 0.5, 4.5, 
16 and 32 hours, are combined with the effort to implement the 
correction (as calculated earlier) and divided equally among the 
modules changed. The sum of these effort portions over all error 
corrections Involving a given module defines the weighted errors 
for the module. 
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IV . Internal Validation of the Software Science Metrics 

The purpose of this section Is to briefly define the 
Software Science metrics, to see how these metrics relate to 
standard program measures and to determine if the metrics are 
Internally consistent. That is, Software Science hypothesizes 
that certain estimators of the basic parameters, such as program 
length N and program level L, can be approximated by formulas 
written totally In terms of the number of unique operators and 
operands. Initially, an attempt is made to find correlations 
between various definitions of these quantities based on the 
interpretations of operators and operands given in the previous 
section. Then, the family of metrics that Software Science pro- 
poses is correlated with traditional measures of software. 

Program length Program length N is defined as the sum of 
the total number of operators N1 and the total number of operands 
N2; l.e., N s N1 N2. Software Science hypothesizes that this 
can be approximated by an estimator that is a function of the 
vocabulary, defined as 

= n1log2(n1) + n21og2(n2). 

The scatter plot appearing In Figure 1 and Pearson correlation 
coefficient of .899 (p < .001; 179^ modules)” show the relation- 
ship between N and N'‘ (polynomial regression rejects including a 
second degree term at p s .05). Several sources [12], [16], 
[26], [27] have observed that the length estimator tends to be 

” The symbol p will be used to stand for significance level. 
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high for small programs and low for large programs. The correla- 
tions and significance levels for the pairwise Wilcoxon statistic 
[20] y broken down by executable statements and length, are 
displayed in Table 1. In our environment, either measure of size 
demonstrates that N* significantly overestimates N in the first 
and second quartlles and underestimates it (most significantly) 
in the fourth quartile. Feuer and Fowlkes [15] assert that the 
accuracy of the relation between the natural logarithms of 
estimated and observed length changes less with program size. The 
scatter plot appearing in Figure 2 and correlation coefficient 
for In H vs. In of .927 (p < .001; 1794 modules) show moderate 
improvement. 

<< Figure 1 >> 


Table 1. Observed vs. 

estimated 

length broken 

down by program slzo 

a. N vs. N“ 

broken 

down by executable statments. 

XQT STMTS 

MOOS 

R- 

ESTIMATION 

WILCOXON SIGNIF 

0-19 

446 

.601 

over 

<<.0001 

20 - 40 

442 

.511 

over 

<<.0001 

41 - 78 

457 

.478 

under 

.0367 

79 < = 

449 

.751 

under 

<<.0001 

b. H vs. 

” Length'll 

broken 

MOOS 

down by H 
R* 

e 

ESTIMATION 

WILCOXON SIGNIF 

0 - 114 

449 

.750 

over 

<<.0001 

115 - 243 

445 

.447 

over 

<<.0001 

244 - 512 

453 

.348 

under 

.0010 

513 <s 

447 

.731 

under 

<<.0001 


(p < .001) 


<< Figure 2 >> 



Program volume A program volume metric V defined as 


N 


log2 n represents the size of an implementation, which can be 
thought of as the number of bits necessary to express it. The 
potential volume 7* of an algorithm reflects the minimum 
representation of that algorithm in a language where the required 
operation is already defined or Implemented. The parameter V* is 
a function of the number of input and output arguments of the 
algorithm and is meant to be a measure of its specification. The 
metric V* is defined as 

V» a (2 ♦ n2«) log2 (2 ♦ n2*). 

The correlation coefficient for V vs. V* of .670 (p < .001 j 179^ 
modules) shows a reasonable relationship between a program's 
necessary volume and its specification. 

Program level The program level L for an algorithm is 
defined as the ratio of its potential volume to the size of its 
implementation, expressed as 

L = V»/V. 

Thus, the highest level for an algorithm is its program specifi- 
cation and there L has- value unity. The larger the size of the 
required implementation V, the lower the program level of the 
implementation. Since L requires the calculation of V*, which is 
not always readily obtainable, Software Science hypothesizes that 
L can be approximated by 


L* 


2 n2 
n1 N2 
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The correlation for L va . of .531 (p < .001; 179^ 

modules) Is disappointingly below that of .90 given In [19l« 
Hoping for an increase in the correlations, the modules are par- 
titioned by the number of executable statements in Table 2. 
Although the upper quartiles show measured Improvement over the 
correlation of the whole sample, a more Interesting relationship 
surfaces. The level estimator significantly underestimates the 
program level in the second, third and fourth quartiles, with the 
hypothesis being rejected in the first quartlle. The increase in 
magnitude of the n2* parameter does not appear to be totally cap- 
tured by the definition of L^. 

Table 2. Relationship of observed vs. estimated program level 
” broken down by program size . 

XQT STMTS MODS R" ESTIMATION WILCOXON SIGNIF 

0-19 .484 

20 - 40 442 .672 under <<.0001 

41 - 78 457 .597 under <<.0001 

79 <= 449 .615 under <<.0001 

all 1794 .531 under <<.0001 

- (p < .001) 

Program difficulty The program difficulty D is defined _as 
the difficulty of coding an algorithm. The metric D and the pro- 
gram level L have an inverse relationship; D is expressed 

D s 1/L . 

An alternate Interpretation of difficulty defines it as the 
inverse of L*, given by 
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1 


n1N2 


D2 . 

L* 2 n2 

Christensen, Fltsos and Smith [12] demonstrate that the unique 
operator count n1 tends to remain relatively constant with 
respect to length for 490 PL/S programs. They propose that the 
average operand usage N2/n2 Is the main contributor to the pro- 
gram difficulty 02. The scatter plot appearing In Figure 3 and 
Pearson correlation coefficient of .729 (p < .001; 1794 modules) 
display the relationship between N2/n2 and 02 for our FORTRAN 
modules. The application of polynomial regression brings In a 
second degree term (p < .001) and results In a correlation of 
.738. 


<< Figure 3 >> 

However, after observing In Figure 4 that n1 varies with program 
size. It seems as If the nl's Inflation might possibly better 
explain 02. The scatter plot appearing In Figure 5 and the 
correlation of .865 (p < .001; 1794 modules) show the relation- 
ship of 02 vs. n1. Step-wise polynomial regression brings In a 
second degree term Initially, followed by a linear term (p < 
.001), and results In a correlation of .879. In our environment, 
the unique operator count n1 explains a greater proportion of the 
variance of the difficulty 02 than the average operand usage 
H2/n2. 

<< Figure 4 >> 
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<< Figure 5 >> 


Program effort The Software Science effort metric E 
attempts to quantify the effort required to comprehend the imple- 
mentation of an algorithm. It is defined as the ratio of the 
volume of an implementation to its level* expressed as 

7 ( V )**2 

L V* 

The B metric Increases for programs Implemented with large 
volumes or written at low program levels; that is* it varies with 
the square of the volume. An approximation to B can be obtained 
without the knowledge of the potential volume by substituting L‘* 
for L in the above equation. The metric 

7 n1 N2 7 n1 N2 N lo.g2 n 

E“ 

L* 2 n2 2 n2 

defines the product of one half the number of unique operators, 
the average operand usage and the volume. In an attempt to 
remove the effect of possible program impurities C9], [19], is 
substituted for H in the above equation, yielding 

N" log2 n n1 N2 (n11og2n1 ♦ n21og2n2) log2 n 

s — s - .... ... . 

L" 2 n2 

The correlation coefficients for E vs. S'* , E vs. E'*'*, In E vs. In 
E" and In E vs. In E*'' are given in Table 3a. A fit of a least 
squares regression line to the log-log plot of E vs. E“ produces 
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the equation 


In E s .830*ln E" ♦ 1.357 . 

Equivalently , 

E = exp(1.357) * (E*)««0.830 . 

Due to this non-linear relationship and the Improved correlation 
of In E vs. In E*, the nodules are partitioned by executable 
statements in Table 3b. The application of polynomial regression 
confirms this non-linearity by bringing in a second degree term 
(p < .001) » resulting in a correlation of .698. In Table 3bt 
notice that the correlations seem substantially better for 
modules below median size. The significant overestlmatlon in the 
upper three quartiles attributes to the relationship of L and L* 
described earlier. 


Table Observed vs . estimated Software Science E metric . 

a. Pearson Correlation (£ < . 001 ; 1794 modules ) . 

R 

E vs. E* .663 

In E vs. In E'“ .931 

E vs. E*'* .603 

In E vs. In E'*“ .890 


b. E vs» S'* broken down by executable statements . 


XQT stmts' 

MODS 

R- 

ESTIMATION 

vilLCOXON SIGNIF 

0-19 

446 

.708 

under 

.0050 

20 - 40 

442 

.709 

over 

<<.0001 

41 - 78 

457 

.411 

over 

<< .0001 

79 <s 

449 

.550 

over 

<< .0001 

’ (P < 

.001 ) 




Program bugs 

Software 

Science 

defines the 

bugs metric B as 


the total number of "delivered" bugs in a given implementation. 
Not to be confused with user acceptance testing, the metric B is 
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the number of Inherent errors in a system component at the com- 
pletion of a distinct phase In its development. Bugs B is 
expressed by 


B = 



? 

Bo 


where Bo is theoretically equivalent to the mean number of ele- 
mentary discriminations between potential errors in programming. 
Through a calculation that employs the definitions of B» L and 
lambda (lambda s LV* is referred to as the language level) > this 
equation becomes 


B 


(lambda)*«1/3 (E)*»2/3 
Bo 


The derivation determines an Bo value of 3000 » assumes 
( lambda) •• 1 /3 1 and obtains 

(E)»*2/3 

3000 

The correlation for B vs. B* is .789 (p < .001 j 179^ modules). 

In summary, the relationship of some q-f the Software Science 
metrics with their estimators seems to be program size dependent. 
Several observations lead to the result that the metric N" signi- 
ficantly overestimates N for modules below the median size and 
underestimates for those above the median size. The level estima- 


tor L* seems to have a moderate correlation with L, and its sig- 



nlflcant underestimation of L in the upper three quartiles 
reflects its failure to capture the magnitude of n2* in the 
larger modules. With respect to the E metric, the effort estima- 
tor E“ correlates better over the whole sample than E'“*, and 
their strongest correlations are for modules below median size. 
The estimator E* shows a non-linear relationship to the effort 
metric E. The correlation of In E vs. In E'* significantly 
Improves over that of E vs. E*, with the E^ metric's overestima- 
tion of E for larger modules attributing to the role of L* in its 
definition. With the above family of metrics, Software Science 
attempts to quantify size and complexity related concepts that 
have traditionally been described by a more fundamental set of 
measures . 

Table 4 displays the correlations of the Software Science 
metrics with the classical program measures of source lines of 
code, cyclomatlc complexity, etc. There are several observations 
worth noting. Length N and volume V have remarkably similar 
correlations and correspond quite well with most of the program 
measures. Several of the metrics correlate well with the number 
of executable statements, especially the program "size" metrics 
of N1, N2, N and V (also B). The level estimator L** and its 
Inverse 02 seem to be much more related to the standard size and 
complexity measures than their counterparts L and 01. The 
language level lambda does not seem to show a significant rela- 
tionship to the standard size and complexity measures, as 
expected. The E'“* metric relates best with the number of execut- 
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able statements and the modified eyelomatic complexity, while 
correlating with all the measures better than the B metric and 
slightly better than E*. None of the Software Science measures 
correlate especially well with the number of revisions or the sum 


Table Comparison of Software Science metrics against more 
” traditional software measures . 

Key: ? not significant at *05 level 

* significant at .05 level 

a significant at .01 level 

otherwise significant at .001 level 


Source Lines Source-Cmmts Cyclo cmplx 2 Calls & Jumps 

r I T ■ r 

I Execut Stmts I Cyclo cmplx I Revisions I Calls 


n1 

.776 

.854 

.778 

.796 

.818 

.361 

.802 

.542 

n2 

.352 

.867 

.853, 

.767 

.774 

.430 

.809 

.614 

N1 

.824 

.964 

.868 

.881 

.889 

.328 

.869 

.552 

N2 

.826 

.949 

.871 

.858 

.870 

.355 

.870 

.597 

n2* 

.792 

.691 

.754 

.635 

.629 

.501 

.683 

.541 

N 

.829 

.961 

.873 

.874 

.884 

.343 

.874 

.577 


.864 

.897 

.364 

.800 

.811 

.420 

.836 

.621 

7 - 

.837 

.962 

.875 

.873 

.883 

.343 

.876 

.584 

7* 

.776 

.677 

.734 

.618 

.611 

.485 

.664 

.525 

L 

-.098 

-.179 

-.112 

-.170 

-.173 

7 

.158 

-.083 

L- 

-.383 

-.411 

-.394 

-.389 

-.396 

-.216 - 

.386 

-.250 

D1=1/L 

.067a 

.244 

.113 

.178 

.196 

-.093 

.134 

7 

D2s1/L" 

.696 

.872 

.745 

.816 

.839 

.269 

.791 

.478 

N2/n2 

.365 

.544 

.437 

.508 

.517 

.106 

.470 

.241 

Lambda 

.136 

? 

.108 

? 

? 

.134 

? 

.051* 

B 

.439 

. 629 

.500 

.535 

.556 

.106 

.506 

.282 

B" 

.663 

.831 

.711 

.771 

.797 

.224 

.748 

.452 

E** 

.738 

.871 

.760 

.799 

.829 

.268 

.788 

.501 

B - 

.837 

.962 

.875 

.873 

.883 

.343 

.876 

.584 

B" 

.546 

.749 

.610 

.650 

.670 

.149 

.620 

.355 

" B and 

7 will have ide 

ntical 

correlations 

since they 

are 

1 Inear 


functions of one another. 
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of procedure and function calls. The primary measures of unique 
operators n1 and unique operands n2 correspond reasonably well 
overall with n2 being stronger with source lines and n1 stronger 
with the cyclomatic complexities. In the next section, an 
analysis attempts to determine the relationship that these param- 
eters really have with the quantities that they theoretically 
describe. 

V. External Validation of the Software Science and Related Metrics 

The purpose of this section is to determine how well the 
Software Science metrics and various complexity measures relate 
to actual effort and errors encountered during the development of 
software in a commercial environment. These objective product 
metrics are compared against more primitive volume metrics, such 
as lines of source code. The reservoir of development data 
includes the monitoring of several projects and the analysis 
examines several projects at once, individual projects and indi- 
vidual programmers across projects. To remove the dependency of 
the distribution of the correlation coefficient on the actual 
measures of effort and errors, the nonparametric Spearman rank 
order correlation coefficients are examined in this section [22]. 
(The ability of a few data points to artificially inflate or 
deflate the Pearson product-moment correlation coefficient is 
well recognized.) The analysis first examines how well these 
measures correspond bo the total effort spent in the development 
of software. 
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A. Metrics * Relation to Actual Effort 


Initially, a correlation across seven projects of the 
Software Science E metric vs. actual effort, on a module by 
module basis using only those that are newly developed, produces 
the results In Table 5. The table also displays the correlations 
of some of the more standard volume metrics with actual' effort. 
These disappointingly low correlations create a fear that there 


Table 5. Spearman rank order correlations Rs with effort for 
all modules ( 731 ) from all projects . 


Key: ? 


a 

otherwise 


not significant at .05 level 
significant at .05 level 
significant at .01 level 
significant at .001 level 


E 

.345 

E* 

.445 

E“* 

.488 

Cyclo_cmplx 

.463 

Cyclo”cmplx_2 

.467 

Calls” 

.414 

Calls & Jumps 

.494 

Disl/L 

.126 

D2=1/L* 

.417 


Source_Llnes 

.522 

Execut_Stmts 

.456 

Source-Cmmts 

.460 

V 

.448 

N 

.434 

etal 

.485 

eta2 

.461 


B 

.448 

B* 

.345 

Revisions 

.531 

Changes 

.469 

Welghted_Chg 

.468 

Errors ” 

.220 

Weighted_Err 

.226 
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may be some modules with poor effort reporting slcewlng the 
analysis. Since there Is partial redundancy built Into the effort 
data collection process, there exists hope of validating the 
effort data. 

Validation of effort data The partial redundancy In the 
development monitoring process Is that both managers and program- 
mers submit effort data. Individual programmers record time spent 
on each module, partitioned by design, code, test and support 
phases, on a weekly basis with a Component Status Report (CSR). 
Managers record the amount of time every programmer spends work- 
ing each week on the project they are supervising with a Resource 
Summary Form (RSF). Since the latter form possesses the enforce- 
ment associated with the distribution of financial resources. It 
Is considered more accurate [24]. However, the Resource Summary 
Form does not break effort down by module, and thus a combination 
of the two forms has to be used. 

Three different possible effort reporting validity cheeks 
are proposed. All employ the idea of selecting programmers that 
tend to be good effort reporters, and then using Just the modules 
that only they worked on In the metric analysis. The three pro- 
posed effort reporting validity checks are: 

number of weekly CSR's submitted by programmer 

a. Vm = 

number of weeks programmer appears on RSF's 
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sum of all man-hours reported by programmer on all GSR's 

b. Vt s " 

sum of all man-hours reported for programmer on all RSF s 

number of weeks programmer's GSR effort > RSF effort 

c. 71 s 1 - 

total number of weeks programmer active in project 

The first validity proposal attempts to capture the frequency of 
the programmer's effort reporting. It checks for missing data by 
ranking the programmers according to the ratio Vm of the number 
of Gomponent Status Reports submitted over the number of weeks 
that the programmer appears on Resource Summary Forma. The second 
validity proposal attempts to capture the total percentage of 
effort reported by the programmer. This proposal ranks the pro- 
grampers according to the ratio Vt formed by the sum of all the 
man-hours reported on Gomponent Status Reports over the sum of 
all hours delegated to him on Resource Summary Forms. 

Note that for a given week, the amount of time reported on a 
Gomponent Status Report should be always less than or equal to 
the amount of time reported on the corresponding Resource Summary 
Form. This is not because the programmer fails to "cover" him- 
self, but a consequence of the management's encouragement for 
programmers to realisticly allocate their time rather than to 
guess in an ad hoc manner. This observation defines a third vali- 
dity proposal to attempt to capture the frequency of a 
programmer's reporting of Inflated effort. This data check ranks 
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the programmers according to the quantity Vi equal to one minus 
the ratio of the number of weeks that CSR effort reported 
exceeded RSF effcrt over the total number of weeks that the pro- 
grammer is active in the project. 

Metrics * relation to validated effort data Of the given 
proposals^ the systems development head of the institution where 
the software is being developed suggests that the first proposal, 
the missing data check, would be a good initial attempt to select 
modules with accurate effort reporting [24], The missing data 
ratios Vm are defined for programmers on a project by project 
basis. Table 6 displays the effort correlations of the newly 
developed modules worked on by only programmers with Vm >s 90K 
from all projects, those with Vm >= 80< and for all newly 
developed modules. Most of the correlations of the modules 
Included in the Vm > = 90J level seem to show improvement over 
those at the Vm >= 80$ level. Although this is the desired effect 
and several of the Vm > = 90$ correlations Increase over the ori- 
ginal values, a majority of the correlations with modules at the 
Vm >s 80$ level are actually lower than their original coeffi- 
cients. Since the effect of the ratio's screening of the data is 
inconsistent and the overall magnitudes of the correlations are 
low, the analysis now examines modules from different projects 
separately . 
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Tabla 6. Soearnan rank order correlations Ha with effort for nodules 
“ across seven projects with various validity levels . 


Key: ? 

t 


a 

otherwise 


not significant at .05 Ihvel 
significant at .05 level 
significant at .01 level 
significant at .001 level 


Validity ratio Vm (#nods) 



all(73D 

80^(398) 905(2 

e 

.345 

.307 

.357 

E* 

.445 

.422 

.467 


.488 

.480 

.513 

Cyolo_cmplx 

.463 

.457 

.479 

Cyclc”cmplx_2 

.467 

.454 

.506 

Calls"" 

.414 

.360 

.402 

Calls i Jumps 

.494 

.475 

.479 

Disl/L “ 

.126 

.088* 

? 

D2a1/L* 

.417 

.371 

.421 

Source_^Llnes 

.522 

.519 

.501 

Bzecut_Stmts 

.456 

.429 

.475 

Souroe-Cmmts 

.460 

.420 

.439 

7 

.448 

.434 

.475 

n 

.434 

.416 

.460 

etal 

.485 

.462 

.493 

eta2 

.461 

.467 

.503 

B 

.448 

.434 

.475 

B* 

.345 

.307 

.357 

Revisions 

.531 

.580 

.565 

Changes 

.469 

.495 

.385 

Weighted_,Chg 

.468 

.521 

' .462 

Errors *" 

.220 

.381 

.205 

Weighted_Err 

.226 

.382 

.247 


The Spearman correlations of the various metrics with effort 
for three of the individual projects appear In Table 7. 
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Table 7. Spearman rank order correlations Rs with effort for 

various validity rankings of modules from Individual 
projects SI , S3 and S7 » 

Key: ? not significant at .05 level 

* significant at .05 level 

a significant at .01 level 

otherwise significant at .001 level 
z unavailable data 


Project 




SI 


S3 

•• 

S7 

— — 

Validity ratio 








Vm 

all 

80!( 

90J 

80% 

90% 

all 

80% 

fmodules 

79 

29 

20 

132 

81 

127 

49 

B 

.613 

.647 

.726 

.469 

.419 

.285 

. 409a 

B" 

.665 

.713 

.746 

.602 

.585 

.389 

.569 


.700 

.747 

.798 

.638 

.640 

.430 

.567 

Cyclo_cmplx 

.757 

.774 

.792 

.583 

.608 

.463 

.523 

Cyclo~cmplx 2 

.764 

.785 

.787 

.609 

.664 

.491 

.523 

Calls” 

.681 

.698 

.818 

.442 

.492 

.404 

.485 

Calls & Jumps 

.776 

.813 

.822 

.594 

.619 

.488 

.569 

D 1 s 1 /L 

.262a 

? 

? 

.156* 

? 

? 

? 

D2a1/L" 

.625 

.681 

.745 

.507 

.442 

.377 

.499 

Source_Lines 

.686 

.672 

.729 

.743 

.734 

.486 

.499 

Execut^Stmts 

.688 

.709 

.781 

.609 

.594 

.408 

.515 

Source-Cmmts 

.670 

.710 

.778 

.671 

.654 

.416 

.471 

V 

.657 

.692 

.774 

.627 

.637 

.377 

.497 

N 

.653 

.680 

.755 

.613 

.619 

.360 

.484 

etal 

.683 

.740 

.848 

.553 

.533 

.439 

.431 

eta2 

.667 

.701 

.747 

.643 

.698 

.365 

. 445 

B 

.657 

.692 

.774 

.627 

.637 

.377 

.497 

B* 

.613 

.643 

.726 

.469 

.419 

.285 

.409a 

Revisions 

.677 

.717 

.804 

.655 

.632 

.449 

.510 

Changes 

.687 

.645 

.760 

.672 

.639 

.238a 

.380a 

Weighted_Chg 

.685 

.629 

.749 

.673 

.649 

.238a 

.256» 

Errors 

z 

z 

z 

.644 

.611 

.253a 

.438 

Weighted__Err 

z 

z 

z 

.615 

.605 

.245a 

.276» 

' All modules 

In project 

S3 were 

developed by 

programmers 


with Vm >s 80>. 


*" There exist fewer than a significant number of modules developed 
by programmers with Vm >= 901C. 
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Although tha correlation ooefficlents vary considerably between 
and among the projects, the overall improvement in projects SI 
and S3 is apparent. Almost every metric's correlation with 
development effort increases with the more reliable data in pro- 
jects 31 and S7. When comparing the strongest correlations from 
the. seven Individual projects, neither Software Science's E 
metrics, cyclomatic complexity nor source lines of code relates 
convincingly better with effort than the others. Note that the 
estimators of the Software Science E metric, E“ and B*"*, appear 
to show a stronger relationship to actual effort than E. 

The validity screening process substantially improves the 
correlations for some projects, but not all. This observation 
points toward the existence of project dependent factors and 
interactions. In an attempt to minimize these intraproject 
effects, the analysis focuses on individual programmers across 
projects. Note that Basill and Hutchens [2] also suggest that 
programmer differences have a large effect on the results when 
many individuals contribute to a project. 

The use of nodules developed solely by individual program- 
mers significantly reduces the number of available data points 
because of the team nature of commercial work. Fortunately, how- 
ever, there are five programmers who totally developed at least 
fifteen modules each. The correlations for all modules developed 
by them and their values of the three proposed validity ratios 
are given in Table 8. The order of Increasing correlation coef- 
ficients for a particular metric can be related to the order of 
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Table 8, Soearnan rank order correlations Rs with effort for nodules 
~ totally deyeloped by five Individual programmers . 


Key: ? 

not 

significant at .05 

level 



significant 

at .05 level 


a 

significant 

at .01 level 


otherwise significant 

at .001 level 




Programmer (#mods) 



PK31) 

P2( 17) 

P3(21) 

P4(24) 

P5(15) 

E 

.593 

? 

? 

.561a 

7 

B" 

.718 

.526* 

.375* 

.555a 

.507* 

B^^ 

.t89 

.570a 

? 

.539a 

.511* 

Cyelo_cmplx 

.592 

.469* 

.521a 

.565a 

7 

Cyelo”cmplx_2 

.684 

.583a 

.481* 

.546a 

7 

Calls” 

.622 

.787 

7 

.669 

7 

Calls i Jumps 

.701 

. 604a 

.451* 

.579a 

7 

Dial/L ” 

.314* 

? 

7 

7 

7 

D2a1/L“ 

.713 

.460* 

7 

.497a 

.467* 

Souree_Llnes 

.863 

.682 

.605a 

.624 

7 

Bxeeut^Stmts 

.747 

.540* 

.436* 

.631 

.534* 

Souree-Cmmts 

.826 

.576a 

.530a 

.612 

.509* 

V 

.718 

.540* 

.453* 

.579a 

.451* 

N 

.676 

.526* 

.461* 

.556a 

.471* 

etal 

.81 1 

.575a 

7 

.536a 

7 

eta2 

.765 

.701 

.527a 

.597 

7 

B 

.718 

.540* 

.453* 

.579a 

.451* 

B* 

.593 

? 

7 

.561a 

7 

Revisions 

.675 

.523* 

.777 

.468* 

7 

Changes 

.412* 

.468* 

. 600a 

7 

7 

Welghted^Chg 

.428a 

.527* 

.502a 

7 

7 

Errors ~ 

.386* 

? 

.668 

7 

.596a 

Welghted_Brr 

.342* 

? 

.624 

7 

.545* 

VALIDITY RATIOS 

(%) 





Vn 

92.5 

96.0 

87.7 

83.9 

74.1 

Vt 

97.9 

91.8 

98.8 

82.1 

74.1 

VI 

78.6 

69.5 

77.6 

80.0 

87.5 

Ave. Vn,Vt 

95.2 

93.9 

93.25 

83.0 

74.1 

Ave. Vn,Vl 

85.5 

82.75 

82.65 

81.95 

80.3 
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increasing values for a given validity ratio using the Spearman 
rank order correlation. The significance levels of these rank 
order correlations for several of the metrics appear In Table 9. 

The statistically significant correspondence between the program- 
mers' validity ratios 7m and the correlation coefficients Justi- 
. fies the use of the ratio 7m in the earlier analysis; possible 
Improvement Is suggested If 7m were combined with either of the 
other two ratios. 

Table 9. Significance levels for the Spearman rank order correlation 
~ between the programmer 'a validity ratios and the eorrelati 
coefficients for several of the metrics. 


Ratio 


Metric 

7m 

7t 

71 

Ave(7m,7t) 

Ave(7m,7i) 

Ave (7t 





.09 

.09 


Cyclo_cmplx 






.05 

Cy c 1 o_cmp 1 x_2 

.05 



.02 

.02 


Calls”4_Jumps 

.05 



.02 

.02 


Source_^Lines 

.05 



.02 

.02 


Source-Cmmts 




.09 

.09 


7 (B) 




.09 

.09 


eta2 

in 

o 

e 



.02 

.02 


Revisions 


.001 

1 

o 

e 

.09 

.09 



" Negative correlation. 


In summaryi the strongest sets of correlations occur between 
the metrics and actual effort for certain validated projects and 
for modules totally developed by individual programmers. While 
relationships across all projects using both all modules and only 
validated modules produce only fair coefficients, the validation 
process shows patterns of improvement. Applying the validity 


4-34 



ratio screening to individual projects seems to filter out some 
of the project specific interacticns while not affecting others, 
with the correlations improving accordingly. Two averages of the 
validity ratios (Vm with Vt and 7m with 71) impose a ranking on 
the individual programmers that statistically agrees with an ord- 
ering of the Improvement of several of the correlations. In all 
sectors of the analysis, the inclusion of in the Software Sci- 
ence E metric in its estimators E"' and E'"* seems to improve the 
metric correlations with actual effort. The analysis now attempts 
to see how well these metrics relate to the number of errors 
encountered during the development of software. 

B. Metric ‘’s Relation to Errors 

This section attempts to determine the correspondence of the 
Software Science and related metrics both to the number of 
development errors and to the weighted sum of effort required to 
Isolate and fix the errors. A correlation across all projects of 
the Software Science bugs metric B and some of the standard 
volume and complexity metrics with errors and weighted errors, 
using only newly developed modules, produces the results in Table 
10. Most of the correlations are very weak, with the exception 
of system changes. These disappointingly low correlations attri- 
bute to the discrete nature of error reporting and that 340 of 
the 652 modules (52t) have zero reported errors. Even though 
these correlations show little or no correspondence, the follow- 
ing observations indicate potential improvement. 
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Table 10. S pearnan rank order eorrelatlona Rs with errors and 

weighted - errors for all nodules ( 652 ) from six projects . 


Key: ? 

t 


a 

otherwise 


not significant at .05 level 
significant at .05 level 
significant at .01 level 
significant at .001 level 


Errors Welghted_err • 


E .083* 

. 101a 

E* .151 

.171 

E*“ .163 

.186 

Cyclo_cmplx .196 

.205 

Cyclo cmplx 2 .189 

.200 

Calls" " .220 

.236 

Calls & Jumps .235 

.248 

D 1 a 1 /L " ? 

7 

D2a1/L“ .124 

.140 


Source^Llnes 

.255 

.265 

Execut^Stmts 

.177 

.198 

Source-Cmmts 

.288 

.298 

7 

.168 

.186 

N 

.162 

.180 

etal 

.102a 

.132 

eta2 

.181 

.199 


B 

.168 

.186 

B* 

.083* 

.101a 

Revisions 

.375 

.375 

Changes 

.677 

.636 

Welghted_Chg 

.627 

.677 

Design Eff 

.219 

.185 

Code Eff 

.285 

.316 

Test“Eff 

.149 

.164 

Tot_Effort 

.324 

.332 


Project SI has no data to distinguish errors from changes. 


Weiss C4]f C5] conducted an extensive error analysis that 
involved three of the projects and employed enforcement of error 
reporting through programmer interviews and hand-checks. For two 
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of the more recent projects, Independent validation and verifica- 
tion was performed. In addition, the on-site systems development 
head asserts that due to the maturity of the collection environ- 
ment, the accuracy of the error reporting is more reliable for 
the more recent projects [24]. These developmental differences 
provide the motivation for an examination of the relationships on 
an individual project basis. 


Table 11 displays the attributes of the projects and the 
correlations of all the metrics vs. errors and weighted errors 
for three of the individual projects. The correlations in S7 , a 
project involved in the Weiss study, are fair but better than 
those of project S5 (not shown) that was developed at about the 
same time. Project S4 and S6 (also not shown) have very poor 
overall correlations and unreasonably low relationships of revi- 
sions with errors, which point to the effect of being early pro- 
jects in the collection effort. The trend that the attributes 
produce is not very apparent, although chronology and error 
reporting enforcement do seem to have some effect. In another 
attempt to Improve the correlations, the analysis applies the 


Table 11 . Spearman rank order correlations Rs with errors and 
weighted-errors for modules from three individual 


Key: ? 


a 

otherwise 


not significant at .05 level 
significant at .05 level 
significant at .01 level 
significant at .001 level 


projects . 


Err errors 

W err weighted-errors 
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Project (#moda) 



S3(1 

32) 

S4(35) 

37(127) 


Err 

W_err 

Err 

W_err 

Err 

W_err 

E 

.401 

.378 

? 

? 

.397 

.391 

B* 

.536 

.482 

? 

? 

.507 

.503 

g-- 

.579 

.522 

? 

7 

.492 

.505 

Cyclo_cmplx 

.542 

.481 

? 

7 

.393 

.368 

Cyclo_cmplx_2 

Calls 

.553 

.445 

.489 

.432 

? 

.300* 

7 

.316* 

.405 

.423 

.400 

.419 

Calls & Jumps 

.566 

.518 

? 

7 

.432 

.412 

D 1 s 1 /L 

? 

? 

? 

7 

.168* 

.178* 

D2=1/L“ 

.491 

.426 

? 

7 

.563 

.559 

Source__^Lines 

.648 

.622 

.339* 

7 

.490 

.487 

Execut_Stmts 

.538 

.505 

? 

7 

.478 

.465 

Source-Cmmts 

.599 

.568 

? 

7 

.501 

.483 

V 

.541 

.495 

7 

7 

.461 

.456 

N 

.526 

.480 

? 

7 

.457 

.449 

etal 

.550 

.500 

? 

7 

.488 

.522 

eta2 

.541 

.500 

? 

7 

.348 

.367 

B 

.541 

.495 

? 

7 

.461 

.456 

B" 

.401 

.378 

? 

7 

.396 

.390 

Revisions 

.784 

.694 

. 686 

.630 

.567 

.500 

Changes 

.939 

.864 

.770 

.761 

.727 

.670 

Weighted_Chg 

.840 

.885 

.661 

.757 

.624 

.714 

Design_Eff 

? 

0 


7 

7 

7 

Code_Eff 

.620 

.632 

.413a 

.398a 

.274 

.264 

Test Eff 

.473 

.481 

. 312 * 

7 

7 

7 

Tot_Bffort 

.644 

.615 

.455a 

.447a 

.253a 

,245a 

PROJECT ATTRIBUTES 






Weiss study 
IV 4 V 

X 



X 


X 

Chronology 

recent 

early 

middle 


previous section's hypothesis of focusing on individual program- 
mers. Table 12 gives the correlations of the metrics with errors 
and weighted errors for modules that two of the individual pro- 
grammers totally developed. Even though it is encouraging to see 
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Table 12 » Spearman rank order correlations Rs with errors and 

weighted - errors for modules totally developed by two 
Individual programmers* 


Key: ? not significant at .05 level 

* significant at .05 level 

a significant at .01 level 

otherwise significant at .001 level 


Err errors 

W_err weighted-errors 


Programmer (#mods) 


P2(17) P3(21) 


E 

E" 

E** 

Cyclo_cmplx 
Cyolo”cmplx 2 
Calls” 

Calls & Jumps 
Dial/L " 
D2a1/L* 


.514» .447* 
.527* .493* 
.515* .473* 
.575a .558a 
.66la .6l6a 
? .498* 

.545* .560a 
7 7 

.558a .526* 


Source_Llnes 

Execut^Stmts 

Source-Cmmts 

V 

N 

eta1 

eta2 


B 

B* 

Revisions 
Changes 
Weighted Chg 


Deslgn_Eff 
Code Eff 
Test"Eff 
Tot Effort 


Err W err 


7 7 

.624a .577a 
7 .436* 

.491* .472* 
.494* .479* 
.497« .448* 
7 7 


.491* .472* 
.514* .447* 
7 7 

.716 .662a 

7 .510* 


7 7 

7 .450* 

7 7 

7 7 


Err W_err 

.368* 7 

.600a .563a 
.666 .649 

.463* .428« 
.484* .449* 
.506a .469* 
.598a .557a 
7 7 

.459* .429* 


.662 .646 

.579a .533a 
.635 .594a 

.679 .655 

.641 .610a 

.611a .589a 
.715 .717 


.679 .655 

.368* 7 

.830 .811 

.855 .828 

.863 .861 


.460* .392* 
.699 .667 
.668 .644 
.668 .624 
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the correspondences of the metrics B, and eta2 with errors as 
among the best for programmer P3» the same metrics do not relate 
as well for other programmers. 

In summary, partitioning an error analysis by individual 
project or programmer shows improved correlations with the vari- 
ous metrics. Strong relationships seem to depend on the indivi- 
dual programmer, while few high correlations show up on a project 
wide basis. The correlations for the projects reflect the posi- 
tive effects of reporting enforcement and collection process 
maturity. Overall, the correlations with total errors are 
slightly higher than those with weighted errors, while the number 
of revisions appears to relate the best. 

VI . Conclusions 

In the Software Engineering Laboratory, the Software Science 
metrics, cyclomatic complexity and various traditional program 
measures have been analyzed for their relation to effort, 
development errors and one another. The major results of this 
investigation are the followings 1) Hone of the metrics examined 
seem to manifest a satisfactory explanation of effort spent 
developing software or the errors incurred during that process; 
2) neither Software Science's E metric, cyclomatic complexity nor 
source lines of code relates convincingly better with effort than 
the others; 3) the strongest effort correlations are derived when 
modules obtained from individual programmers or certain validated 
projects are considered; 4) the majority of the effort correla- 
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tlons Increase with the more reliable data; 5) the number of 
revisions appears to correlate with development errors better 
than either Software Science's B metric, B metric, cyclomatlc 
complexity or source lines of code; and 6) although some of the 
Software Science metrics have size dependent properties with 
their estimators, the metric family seems to possess reasonable 
internal consistency. These and the other results of this study 
contribute to the validation of software metrics proposed in the 
literature. The validation process must continue before metrics 
can be effectively used in the characterization and evaluation of 
software and in the prediction of its attributes. 
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